Road Traffic crash data are useful tools to support the development and implementation of road safety programs that leads to reduced road traffic accidents and in turn saves people’s lives. This report represents my analysis for the Victoria road crashes from July 2014 to June 2019. Through the various data transformation and visualisation tools I will look for data insights and share my observation with Victoria Department of Transport.

Problem description and questions

This dataset documents crashes on Victorian roads during the five year reporting period from July 2014 to June 2019. This data gives me the opportunity to analyse Victorian fatal and injury crash data based on time, locations, conditions, crash type, road user type etc. For this analysis I will mainly focus on finding answers to the following questions by means of exploration and visualisation.

The outcome of the analysis would guide Victoria Department of Transport manage their road safety campaign more efficiently and target investment and resources in the core areas of need and concern.

Data explanation

# importing necessary libraries
library(tidyverse)
## Warning: replacing previous import 'lifecycle::last_warnings' by
## 'rlang::last_warnings' when loading 'pillar'
## Warning: replacing previous import 'lifecycle::last_warnings' by
## 'rlang::last_warnings' when loading 'hms'
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.7
## ✓ tidyr   1.2.0     ✓ stringr 1.4.0
## ✓ readr   2.0.1     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(dplyr)
library(visdat)
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(ggplot2)
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
# reading data from csv file
crashes_data <- read_csv("Road_Crashes_for_five_Years_-_Victoria.csv")
## Rows: 77513 Columns: 65── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (29): ACCIDENT_NO, ABS_CODE, ACCIDENT_STATUS, ALCOHOLTIME, ACCIDENT_TYP...
## dbl  (34): X, Y, OBJECTID, NODE_ID, LONGITUDE, LATITUDE, VICGRID_X, VICGRID_...
## date  (1): ACCIDENT_DATE
## time  (1): ACCIDENT_TIME
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# changing variable names to lower cases for better legibility
names(crashes_data) <- tolower(names(crashes_data))
# returning first six rows 
head(crashes_data)
## # A tibble: 6 × 65
##          x        y objectid accident_no  abs_code accident_status accident_date
##      <dbl>    <dbl>    <dbl> <chr>        <chr>    <chr>           <date>       
## 1 2498217. 2454870.        1 T20140013514 ABS to … Unfinished      2014-07-01   
## 2 2502935. 2423005.        2 T20140013549 ABS to … Finished        2014-07-02   
## 3 2520154. 2395622.        3 T20140013561 ABS to … Finished        2014-07-02   
## 4 2473719. 2402357.        4 T20140013563 ABS to … Finished        2014-07-02   
## 5 2484074. 2409124.        5 T20140013567 ABS to … Finished        2014-07-02   
## 6 2496691. 2406552.        6 T20140013582 ABS to … Finished        2014-07-02   
## # … with 58 more variables: accident_time <time>, alcoholtime <chr>,
## #   accident_type <chr>, day_of_week <chr>, dca_code <chr>, hit_run_flag <chr>,
## #   light_condition <chr>, police_attend <chr>, road_geometry <chr>,
## #   severity <chr>, speed_zone <chr>, run_offroad <chr>, node_id <dbl>,
## #   longitude <dbl>, latitude <dbl>, node_type <chr>, lga_name <chr>,
## #   region_name <chr>, vicgrid_x <dbl>, vicgrid_y <dbl>, total_persons <dbl>,
## #   inj_or_fatal <dbl>, fatality <dbl>, seriousinjury <dbl>, …
# printing structure of data
str(crashes_data)
## spec_tbl_df [77,513 × 65] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ x                : num [1:77513] 2498217 2502935 2520154 2473719 2484074 ...
##  $ y                : num [1:77513] 2454870 2423005 2395622 2402357 2409124 ...
##  $ objectid         : num [1:77513] 1 2 3 4 5 6 7 8 9 10 ...
##  $ accident_no      : chr [1:77513] "T20140013514" "T20140013549" "T20140013561" "T20140013563" ...
##  $ abs_code         : chr [1:77513] "ABS to receive accident" "ABS to receive accident" "ABS to receive accident" "ABS to receive accident" ...
##  $ accident_status  : chr [1:77513] "Unfinished" "Finished" "Finished" "Finished" ...
##  $ accident_date    : Date[1:77513], format: "2014-07-01" "2014-07-02" ...
##  $ accident_time    : 'hms' num [1:77513] 13:09:00 00:01:00 08:30:00 10:08:00 ...
##   ..- attr(*, "units")= chr "secs"
##  $ alcoholtime      : chr [1:77513] "No" "Yes" "No" "No" ...
##  $ accident_type    : chr [1:77513] "Collision with vehicle" "Collision with vehicle" "collision with some other object" "Collision with vehicle" ...
##  $ day_of_week      : chr [1:77513] "Monday" "Wednesday" "Wednesday" "Wednesday" ...
##  $ dca_code         : chr [1:77513] "RIGHT THROUGH" "U TURN" "STRUCK OBJECT ON CARRIAGEWAY" "CROSS TRAFFIC(INTERSECTIONS ONLY)" ...
##  $ hit_run_flag     : chr [1:77513] "No" "No" "No" "Yes" ...
##  $ light_condition  : chr [1:77513] "Day" "Dark Street lights on" "Dusk/Dawn" "Unk." ...
##  $ police_attend    : chr [1:77513] "Yes" "Yes" "Yes" "No" ...
##  $ road_geometry    : chr [1:77513] "Unknown" "Not at intersection" "Not at intersection" "Cross intersection" ...
##  $ severity         : chr [1:77513] "Other injury accident" "Other injury accident" "Other injury accident" "Other injury accident" ...
##  $ speed_zone       : chr [1:77513] "60 km/hr" "60 km/hr" "50 km/hr" "Not known" ...
##  $ run_offroad      : chr [1:77513] "No" "No" "No" "No" ...
##  $ node_id          : num [1:77513] 49465 272595 273609 44382 38531 ...
##  $ longitude        : num [1:77513] 145 -1 NA -1 -1 ...
##  $ latitude         : num [1:77513] -37.4 -1 NA -1 -1 ...
##  $ node_type        : chr [1:77513] "Intersection" "Non-Intersection" "Non-Intersection" "Intersection" ...
##  $ lga_name         : chr [1:77513] "MITCHELL" "WHITTLESEA" "KNOX" "WYNDHAM" ...
##  $ region_name      : chr [1:77513] "NORTHERN REGION" "METROPOLITAN NORTH WEST REGION" "METROPOLITAN SOUTH EAST REGION" "METROPOLITAN NORTH WEST REGION" ...
##  $ vicgrid_x        : num [1:77513] 2498217 2502935 2520154 2473719 2484074 ...
##  $ vicgrid_y        : num [1:77513] 2454870 2423005 2395622 2402357 2409124 ...
##  $ total_persons    : num [1:77513] 4 4 1 3 2 1 2 5 4 3 ...
##  $ inj_or_fatal     : num [1:77513] 1 3 1 1 1 1 1 2 1 1 ...
##  $ fatality         : num [1:77513] 0 0 0 0 0 0 0 0 0 0 ...
##  $ seriousinjury    : num [1:77513] 0 0 0 0 0 1 1 1 0 0 ...
##  $ otherinjury      : num [1:77513] 1 3 1 1 1 0 0 1 1 1 ...
##  $ noninjured       : num [1:77513] 3 1 0 2 1 0 1 3 3 2 ...
##  $ males            : num [1:77513] 1 0 0 2 2 1 1 3 2 0 ...
##  $ females          : num [1:77513] 3 3 1 1 0 0 1 2 2 3 ...
##  $ bicyclist        : num [1:77513] 0 0 0 0 0 0 0 0 0 0 ...
##  $ passenger        : num [1:77513] 2 2 0 1 0 0 0 2 2 1 ...
##  $ driver           : num [1:77513] 2 2 1 2 2 0 2 3 2 2 ...
##  $ pedestrian       : num [1:77513] 0 0 0 0 0 0 0 0 0 0 ...
##  $ pillion          : num [1:77513] 0 0 0 0 0 0 0 0 0 0 ...
##  $ motorist         : num [1:77513] 0 0 0 0 0 1 0 0 0 0 ...
##  $ unknown          : num [1:77513] 0 0 0 0 0 0 0 0 0 0 ...
##  $ ped_cyclist_5_12 : num [1:77513] 0 0 0 0 0 0 0 0 0 0 ...
##  $ ped_cyclist_13_18: num [1:77513] 0 0 0 0 0 0 0 0 0 0 ...
##  $ old_pedestrian   : num [1:77513] 0 0 0 0 0 0 0 0 0 0 ...
##  $ old_driver       : num [1:77513] 0 0 0 0 1 0 0 0 0 0 ...
##  $ young_driver     : num [1:77513] 0 1 0 0 0 0 0 0 1 0 ...
##  $ alcohol_related  : chr [1:77513] "No" "No" "No" "No" ...
##  $ unlicencsed      : num [1:77513] 1 0 0 0 0 0 0 0 0 0 ...
##  $ no_of_vehicles   : num [1:77513] 2 2 1 2 2 1 2 3 2 2 ...
##  $ heavyvehicle     : num [1:77513] 0 0 0 0 1 0 0 0 0 0 ...
##  $ passengervehicle : num [1:77513] 2 2 1 2 1 0 1 3 2 2 ...
##  $ motorcycle       : num [1:77513] 0 0 0 0 0 1 0 0 0 0 ...
##  $ publicvehicle    : num [1:77513] 0 0 0 0 0 0 0 0 0 0 ...
##  $ deg_urban_name   : chr [1:77513] "TOWNS" "MELB_URBAN" "MELB_URBAN" "MELB_URBAN" ...
##  $ deg_urban_all    : chr [1:77513] "TOWNS" "MELB_URBAN" "MELB_URBAN" "MELB_URBAN" ...
##  $ lga_name_all     : chr [1:77513] "MITCHELL" "WHITTLESEA" "KNOX" "WYNDHAM" ...
##  $ region_name_all  : chr [1:77513] "NORTHERN REGION" "METROPOLITAN NORTH WEST REGION" "METROPOLITAN SOUTH EAST REGION" "METROPOLITAN NORTH WEST REGION" ...
##  $ srns             : chr [1:77513] "B" NA NA "C" ...
##  $ srns_all         : chr [1:77513] "B" NA NA "C" ...
##  $ rma              : chr [1:77513] "Arterial Highway" "Arterial Other" "Local Road" "Arterial Other" ...
##  $ rma_all          : chr [1:77513] "Local Road,Arterial Highway" "Arterial Other" "Local Road" "Arterial Other,Local Road" ...
##  $ divided          : chr [1:77513] "Undivided" "Undivided" "Undivided" "Divided" ...
##  $ divided_all      : chr [1:77513] "Undivided" "Undivided" "Undivided" "Undivided,Divided" ...
##  $ stat_div_name    : chr [1:77513] "Country" "Metro" "Metro" "Metro" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   X = col_double(),
##   ..   Y = col_double(),
##   ..   OBJECTID = col_double(),
##   ..   ACCIDENT_NO = col_character(),
##   ..   ABS_CODE = col_character(),
##   ..   ACCIDENT_STATUS = col_character(),
##   ..   ACCIDENT_DATE = col_date(format = ""),
##   ..   ACCIDENT_TIME = col_time(format = ""),
##   ..   ALCOHOLTIME = col_character(),
##   ..   ACCIDENT_TYPE = col_character(),
##   ..   DAY_OF_WEEK = col_character(),
##   ..   DCA_CODE = col_character(),
##   ..   HIT_RUN_FLAG = col_character(),
##   ..   LIGHT_CONDITION = col_character(),
##   ..   POLICE_ATTEND = col_character(),
##   ..   ROAD_GEOMETRY = col_character(),
##   ..   SEVERITY = col_character(),
##   ..   SPEED_ZONE = col_character(),
##   ..   RUN_OFFROAD = col_character(),
##   ..   NODE_ID = col_double(),
##   ..   LONGITUDE = col_double(),
##   ..   LATITUDE = col_double(),
##   ..   NODE_TYPE = col_character(),
##   ..   LGA_NAME = col_character(),
##   ..   REGION_NAME = col_character(),
##   ..   VICGRID_X = col_double(),
##   ..   VICGRID_Y = col_double(),
##   ..   TOTAL_PERSONS = col_double(),
##   ..   INJ_OR_FATAL = col_double(),
##   ..   FATALITY = col_double(),
##   ..   SERIOUSINJURY = col_double(),
##   ..   OTHERINJURY = col_double(),
##   ..   NONINJURED = col_double(),
##   ..   MALES = col_double(),
##   ..   FEMALES = col_double(),
##   ..   BICYCLIST = col_double(),
##   ..   PASSENGER = col_double(),
##   ..   DRIVER = col_double(),
##   ..   PEDESTRIAN = col_double(),
##   ..   PILLION = col_double(),
##   ..   MOTORIST = col_double(),
##   ..   UNKNOWN = col_double(),
##   ..   PED_CYCLIST_5_12 = col_double(),
##   ..   PED_CYCLIST_13_18 = col_double(),
##   ..   OLD_PEDESTRIAN = col_double(),
##   ..   OLD_DRIVER = col_double(),
##   ..   YOUNG_DRIVER = col_double(),
##   ..   ALCOHOL_RELATED = col_character(),
##   ..   UNLICENCSED = col_double(),
##   ..   NO_OF_VEHICLES = col_double(),
##   ..   HEAVYVEHICLE = col_double(),
##   ..   PASSENGERVEHICLE = col_double(),
##   ..   MOTORCYCLE = col_double(),
##   ..   PUBLICVEHICLE = col_double(),
##   ..   DEG_URBAN_NAME = col_character(),
##   ..   DEG_URBAN_ALL = col_character(),
##   ..   LGA_NAME_ALL = col_character(),
##   ..   REGION_NAME_ALL = col_character(),
##   ..   SRNS = col_character(),
##   ..   SRNS_ALL = col_character(),
##   ..   RMA = col_character(),
##   ..   RMA_ALL = col_character(),
##   ..   DIVIDED = col_character(),
##   ..   DIVIDED_ALL = col_character(),
##   ..   STAT_DIV_NAME = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
dim(crashes_data)
## [1] 77513    65

The Road Crashes data for Victoria from July 2014 to June 2019 was downloaded from Victoria Department of Transport Open Data Hub.. The data is provided by VicRoads for educational and research purposes. The data was downloaded in csv format and it contains a single file. The dataset has 77513 observations and 65 features. The features include categorical and numerical variables. I get the impression that great deal of effort was made to collect this data and as a result, for most part it looks thorough and accurate. From the high level perspective, I observed that number of road crashes have been steadily reduced in last 3-4 years, vast majority of the accidents though still happen in the rush hours and concentrated in two Metropolitan regions. Due to the time constraints this report will analyse the Victorian fatal and injury crash data based on these key variables - accident_date, accident_time,alcoholtime, accident_type, day_of_week, light_condition, severity, speed_zone, total_persons, longitude, latitude and region_name.

# selecting only the variables necessary for this analysis
sel_crashes_data <- crashes_data %>% 
  select(c("accident_date", "accident_time", "alcoholtime", 
           "accident_type", "day_of_week", "light_condition", "severity", 
           "speed_zone", "total_persons", "longitude", "latitude", 
           "region_name"))

Data exploration and visualisation

As part of my research, I traversed through 14 non-identical plots but only present the most relevant ones in this report. To start with I will run summary function to show a set of descriptive statistics for every variable.

# extract summary data of individual objects
summary(sel_crashes_data)
##  accident_date        accident_time     alcoholtime        accident_type     
##  Min.   :2014-07-01   Length:77513      Length:77513       Length:77513      
##  1st Qu.:2015-08-27   Class1:hms        Class :character   Class :character  
##  Median :2016-10-11   Class2:difftime   Mode  :character   Mode  :character  
##  Mean   :2016-11-14   Mode  :numeric                                         
##  3rd Qu.:2018-02-04                                                          
##  Max.   :2019-06-30                                                          
##                                                                              
##  day_of_week        light_condition      severity          speed_zone       
##  Length:77513       Length:77513       Length:77513       Length:77513      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  total_persons      longitude        latitude      region_name       
##  Min.   : 1.000   Min.   : -1.0   Min.   :-39.02   Length:77513      
##  1st Qu.: 2.000   1st Qu.:144.8   1st Qu.:-37.95   Class :character  
##  Median : 2.000   Median :145.0   Median :-37.81   Mode  :character  
##  Mean   : 2.346   Mean   :139.7   Mean   :-36.47                     
##  3rd Qu.: 3.000   3rd Qu.:145.2   3rd Qu.:-37.67                     
##  Max.   :89.000   Max.   :149.8   Max.   : -1.00                     
##  NA's   :1        NA's   :1106    NA's   :1106
# get the number of missing values per column
colSums(is.na(sel_crashes_data))
##   accident_date   accident_time     alcoholtime   accident_type     day_of_week 
##               0               0               0               0            2360 
## light_condition        severity      speed_zone   total_persons       longitude 
##               0               1               0               1            1106 
##        latitude     region_name 
##            1106             600

From running above function, we see 2360 values missing in day_of_week, 600 values missing in region_name and just 1 each missing in few other columns. I can handle these missing values by just dropping all the rows with NA’s since they represent very low percentage (2360/77000 = 3.06%) of overall data. For this analysis though I will impute the missing values by filling the columns with mean/mode values to make use of all the data and get better distribution.

# imputing null values with median value for numerical columns and mode value for categorical columns
sel_crashes_data[] <- lapply(sel_crashes_data, function(x) {
  if (class(x) == "character") {
    xtab <- table(x)
  xmode <- names(which(xtab == max(xtab)))
  x[is.na(x)] <- xmode
  }
  else {
     x[is.na(x)] <- median(x, na.rm = TRUE)
  }
  x
})
## Warning in if (class(x) == "character") {: the condition has length > 1 and only
## the first element will be used
# Checking the missing data in each column
vis_miss(sel_crashes_data, 'warn_large_data' = FALSE)
## Warning: `gather_()` was deprecated in tidyr 1.2.0.
## Please use `gather()` instead.

If we check for missing values using vis_miss function from visdat library, we see there is no missing values exist in the data.

Now I will look for the data points that are really far from the rest of the data points, also called outliers by drawing a boxplot. Removing the outliers will reduce skewness of the data and improve statistical significance.

# checking for outliers
boxplot(sel_crashes_data[,c("total_persons")])

It seems like there is a lot of outliers represented by overlapping points over the box near the ‘0’ line. One particular value in total_persons though really distanced itself from the rest. I will remove this value by filtering the dataset and keep the rest to ensure data integrity.

# removing outliers from the data
sel_crashes_data <- sel_crashes_data %>% 
  filter(total_persons < 89)
# after removing the outliers
boxplot(sel_crashes_data[,c("total_persons")])

From the above boxplot, it is pretty evident that the extreme value has now been removed.

Now I will do further exploration and visualisation of the data for analysis based on the questions defined in problem description.

How has accident frequency changed over the years?

I will first extract ‘year’ from ‘accident_date’ and then use aggregate functions to summarise ‘number_of_accidents’ in a tibble format. Then create a line plot using geom_line function from ggplot2 library to show the distribution of the accidents over the years. In this instance, line plot will be really useful to show trends in number of accidents from 2014 to 2019. By mapping year to x axis and number of accidents to y axis, we will see yearly trend in the accident number.

# extracting year from accident_date and creating a separate 'year' column
sel_crashes_data <-  sel_crashes_data %>% 
  mutate(year = format(sel_crashes_data$accident_date, format = "%Y"))
# using aggregate functions to summarise the number of accidents 
# and assigning it to the variable 'accidents_by_year'
accidents_by_year <- sel_crashes_data %>% 
  group_by(year) %>% 
  summarise(number_of_accidents = n())
accidents_by_year$year <-  as.numeric(accidents_by_year$year)
accidents_by_year
## # A tibble: 6 × 2
##    year number_of_accidents
##   <dbl>               <int>
## 1  2014                8321
## 2  2015               17020
## 3  2016               17109
## 4  2017               14459
## 5  2018               13682
## 6  2019                6921
# visualising the number of accidents by each year using geom_line
year_plot <- ggplot(accidents_by_year, aes(x = year, y = number_of_accidents)) +
             # white background with grid lines
             theme_bw() +
             geom_line() +
             labs(x = "Year", 
                  y = "Number of Accidents", 
                  title = "Distribution of Road Crashes By Year") +
             # center aligning the plot title
             theme(plot.title = element_text(hjust = 0.5))
# adding interactive elements to the plot
interactive_year_plot <- ggplotly(year_plot)
interactive_year_plot

Was a particular day of the week more prone to accidents?

To generate relevant data, I will use aggregate functions to summarise ‘number_of_accidents’ based on the variable ‘day_of_week’. Here, We have a categorical variable in ‘day_of_week’ and numeric variable in ‘number_of_accidents’. So a barplot will be quite useful to display this relationship. We will find if any particular day/s of the week is/are more prone to the accidents by mapping day of week to x axis and number of accidents to y axis.

# using aggregate functions to summarise the number of accidents 
# based on day of week and assigning it to the variable 'accidents_by_day'
accidents_by_day <- sel_crashes_data %>% 
  group_by(day_of_week) %>% 
  summarise(number_of_accidents = n())
accidents_by_day
## # A tibble: 7 × 2
##   day_of_week number_of_accidents
##   <chr>                     <int>
## 1 Friday                    14182
## 2 Monday                    10558
## 3 Saturday                   8509
## 4 Sunday                    10170
## 5 Thursday                  11675
## 6 Tuesday                   11033
## 7 Wednesday                 11385
# visualising the number of accidents by day of week using geom_bar
day_plot <- ggplot(accidents_by_day, aes(x = day_of_week, y = number_of_accidents)) +
            # white background with grid lines          
            theme_bw() +
            geom_bar(stat = "identity", color = "white") +
            labs(x = "Day of Week", 
                 y = "Number of Accidents", 
                 title = "Distribution of Road Crashes By Day of Week") +
            # center aligning the plot title
            theme(plot.title = element_text(hjust = 0.5))
# adding interactive elements to the plot
interactive_day_plot <- ggplotly(day_plot)
interactive_day_plot

Was a particular time of the day(morning rush/evening rush) more prone to accidents?

In order to create a tibble of hour and number of accidents, first I need to extract hour from the variable ‘accident_time’. Then use aggregate functions to calculate the number of accidents recorded by each hour. I will mostly use bar plots for the analysis since we have many categorical variables in the dataset. We can find out how much of an impact rush hours has on the number of accidents by mapping hour to x axis and number of accidents to y axis.

# creating new 'hour' column
sel_crashes_data$hour <- format(strptime(sel_crashes_data$accident_time, "%H:%M:%S"), "%H")

# using aggregate functions to summarise the number of accidents 
# by the hour and assigning it to the variable 'accidents_by_hour'
accidents_by_hour <- sel_crashes_data %>% 
  group_by(hour) %>% 
  summarise(number_of_accidents = n())
accidents_by_hour
## # A tibble: 24 × 2
##    hour  number_of_accidents
##    <chr>               <int>
##  1 00                    981
##  2 01                    853
##  3 02                    669
##  4 03                    585
##  5 04                    581
##  6 05                   1042
##  7 06                   2138
##  8 07                   3292
##  9 08                   4841
## 10 09                   4139
## # … with 14 more rows
# visualising the number of accidents by hour using geom_bar
hour_plot <- ggplot(accidents_by_hour, aes(x = hour, y = number_of_accidents)) +
             # white background with grid lines
             theme_bw() +
             geom_bar(stat = "identity") +
             labs(x = "Hour", 
                  y = "Number of Accidents", 
                  title = "Distribution of Road Crashes By Hour") +
             # center aligning the plot title
             theme(plot.title = element_text(hjust = 0.5))
# adding interactive elements to the plot
interactive_hour_plot <- ggplotly(hour_plot)
interactive_hour_plot

How are the accidents distributed across regions? How does it trend over the years?

First, filter out the ‘NA’ values from ‘region_name’ to better represent the available data. Next step I map region name to x axis and assign ‘year’ to fill on a bar chart to see the relative proportion of yearly accidents by the region name. Position adjustment is set to ‘dodge’ to arrange elements side by side. This will allow us to see the yearly trends in number of accidents for each individual region.

# using filter to remove 'region_name' with 'NA' values
sel_crashes_data <- sel_crashes_data %>% 
  filter(region_name != "NA")

# visualising the number of accidents by year and region using geom_bar
region_plot <- ggplot(sel_crashes_data, aes(x = region_name, fill = year)) +
               # white background with grid lines
               theme_bw() +
               geom_bar(position = "dodge") +
               # adapting a colour palette particularly suited to display discreet values
               scale_fill_brewer(palette = "Set2") +
               labs(x = "Region", 
                    y = "Number of Accidents", 
                    title = "Distribution of Road Crashes By Year and Region",
                    fill = "  Year") +
               # center aligning the plot title
               theme(plot.title = element_text(hjust = 0.5)) +
               # rotating and spacing x label
               theme(axis.text.x = element_text(angle = 30, vjust=0.5, hjust=1))
# adding interactive elements to the plot
interactive_region_plot <- ggplotly(region_plot)
interactive_region_plot

How are alcohol time and overall visibility contributed to the accidents?

I will again use filter function to remove insignificant values from ‘light_condition’ to get more coherent visualisation outcome. Then map light condition to x axis and assign ‘alcoholtime’ to fill on a bar chart to see the relative proportion of accidents by the light condition and alcohol time. Alcohol times are defined as the following: Monday - Thursday 00:00 - 06:00 hours & 18:00 - 23:59 hours, Friday 00:00 - 06:00 hours & 16:00 - 23:59 hours, Saturday 00:00 - 08:00 hours & 14:00 - 23:59 hours, Sunday 00:00 - 10:00 hours & 16:00 - 23:59 hours. I hope to find any correlation between light condition/alcohol time and number of accidents.

# using filter to remove 'light_condition' with 'Unk' and 'Dark Street lights unknown' values
sel_crashes_data <- sel_crashes_data %>% 
  filter(light_condition != "Unk.", light_condition != "Dark Street lights unknown")

# visualising the factors of alcohol time and light condition to number of accidents using geom_bar
light_plot <- ggplot(sel_crashes_data, aes(x = light_condition, fill = alcoholtime)) +
              # white background with grid lines
              theme_bw() +
              geom_bar() +
              # dividing plot into rectangular subplots
              facet_wrap(~ alcoholtime) +
              # adapting a colour palette particularly suited to display discreet values
              scale_fill_brewer(palette = "Set2") +
              labs(x = "Light Condition", 
                   y = "Number of Accidents", 
                   title = "Alcohol Time and Light Condition Factor to Vehicle Crashes",
                   fill = "  Alcohol") +
              # center aligning the plot title
              theme(plot.title = element_text(hjust = 0.5)) +
              # rotating and spacing x label
              theme(axis.text.x = element_text(angle = 90, vjust=0.5, hjust=1)) +
              # set legend position to top
              theme(legend.position = "top")
# adding interactive elements to the plot
interactive_light_plot <- ggplotly(light_plot)
interactive_light_plot

How has the various speed zones affected the severity of the accidents?

We can find the impact of different speed limits on the severity of the accidents by producing a bar plot with aesthetic mapping of severity to x axis and assigning ‘speed_zone’ to fill. This should give us a clear picture if higher speed limit has any correlation with the severity of the accidents.

# using filter to remove insignificant values
sel_crashes_data <- sel_crashes_data %>% 
  filter(speed_zone != "Camping grounds or off road", speed_zone != "Not known", 
         speed_zone != "Other speed limit", severity != "Non injury accident",
         severity != "NA")

# visualising accident severity by different speed zones using geom_bar
speed_plot <- ggplot(sel_crashes_data, aes(x = severity, fill = speed_zone)) +
              # white background with grid lines
              theme_bw() +
              geom_bar(position = "dodge") +
              # adapting a colour palette particularly suited to display discreet values
              scale_fill_brewer(palette = "Set3") +
              labs(x = "Severity", 
                   y = "Number of Accidents", 
                   title = "Severity Distribution by Speed Limit",
                   fill = "   Speed Zone") +
              theme(legend.position = "bottom") +
              # center aligning the plot title
              theme(plot.title = element_text(hjust = 0.5))
# adding interactive elements to the plot
interactive_speed_plot <- ggplotly(speed_plot)
interactive_speed_plot

How does the collision type affect severity of the accidents?

To help us finding the answer for this, I map ‘severity’ to x axis and assign ‘accident_type’ to aesthetics parameter ‘fill’ on a bar chart. Position adjustment is set to ‘dodge’ to arrange elements side by side. This should show us accident severity relative to collusion type.

# visualising accident severity by different speed zones using geom_bar
accident_type_plot <- ggplot(sel_crashes_data, aes(x = severity, fill = accident_type)) +
                      # white background with grid lines
                      theme_bw() +
                      geom_bar(position = "dodge") +
                      # adapting a colour palette particularly suited to display discreet values
                      scale_fill_brewer(palette = "Set3") +
                      labs(x = "Severity", 
                           y = "Number of Accidents", 
                           title = "Accident Severity by Collision Type",
                           fill = "  Accident Type") +
                      # center aligning the plot title
                      theme(plot.title = element_text(hjust = 0.5))
# adding interactive elements to the plot
interactive_accident_type_plot <- ggplotly(accident_type_plot)
interactive_accident_type_plot

Results

These are the outcome of my analysis of ‘Victoria Crashes Data’ through the exploration and visualisation processes above:

Conclusion

Data collection for ‘Road Crashes Victoria’ began in July 2014 and regularly being updated ever since. I did an analysis from 2014 to 2019 where data is available for the entire year. The following trends were observed in the Victoria Road Crashes data:

References

Plotly. (2021). Plotly R open source graphing library. https://plotly.com/r/

R Studio. (2021). RStudio cheatsheets. https://www.rstudio.com/resources/cheatsheets/

Scientific Research. (2021). Road Traffic Crash Data: An overview on sources, problems and collection methods. https://www.scirp.org/journal/paperinformation.aspx?paperid=75975

Tierney, N. (2021). Getting started with naniar. http://naniar.njtierney.com/articles/getting-started-w-naniar.html#introduction

Tierney, N. (2020). RMarkdown for scientists. https://rmd4sci.njtierney.com/

Victoria Department of Transport Open Data Hub. (2021). Road crashes for five years - Victoria. https://vicroadsopendata-vicroadsmaps.opendata.arcgis.com/datasets/vicroadsmaps::road-crashes-for-five-years-victoria/about

Wickham, H. (2nd edition). Elegent graphics for data analysis. Springer.

Wickham, H. & Grolemund, G. (2017). R for data science. O’Reilly Media , Inc..